METR is a research nonprofit that works on assessing whether cutting-edge AI systems could pose catastrophic risks to society.
We build the science of accurately assessing risks, so that humanity is informed before developing transformative AI systems.
AI’s Transformative Potential
We believe that AI could change the world quickly and drastically, with potential for both enormous good and enormous harm. We also believe it’s hard to predict exactly when and how this might happen.
At some point, AIs will probably be able to do most of what humans can do, including developing new technologies, starting businesses and making money, finding new cybersecurity exploits and fixes, and more.
We think it’s very plausible that AI systems could end up pursuing goals that are at odds with a thriving civilization. This could be due to deliberate effort to cause chaos or happen despite the intention to only develop AI systems that are safe.1
Given how quickly things could play out, we don’t think it’s good enough to “wait and see” whether there are dangers.
We believe in vigilantly, continually assessing risks. If an AI brings significant risk of a global catastrophe, the decision to develop and/or release it can’t lie only with the company that creates it.
Partnerships
We have previously worked with Anthropic, OpenAI, and other companies to pilot some informal pre-deployment evaluation procedures. These companies have also given us some kinds of non-public access and provided compute credits to support evaluation research.
We think it’s important for there to be third-party evaluators with formal arrangements and access commitments – both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations.
We do not yet have such arrangements, but we are excited about taking more steps in this direction.
We are also partnering with the UK AI Safety Institute and are part of the NIST AI Safety Institute Consortium.
Our Work
Our research is focused on how to evaluate AI systems for dangerous capabilities:
A standard way to define tasks for evaluating the capabilities of AI agents.
Blog
-
There are reasons to expect goal-directed behavior to emerge in AI systems, and to expect that superficial attempts to align ML systems will result in sycophantic or deceptive behavior – “playing the training game” – rather than successful alignment. ↩